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Abstract 

Support vector machines and kernel methods have recently gained considerable attention in chemoinformat- 
ics. They offer generally good performance for problems of supervised classification or regression, and provide 
a flexible and computationally efficient framework to include relevant information and prior knowledge about 
the data and problems to be handled. In particular, with kernel methods molecules do not need to be represented 
and stored explicitly as vectors or fingerprints, but only to be compared to each other through a comparison 
function technically called a kernel. While classical kernels can be used to compare vector or fingerprint rep- 
resentations of molecules, completely new kernels were developed in the recent years to directly compare the 
2D or 3D structures of molecules, without the need for an explicit vectorization step through the extraction of 
molecular descriptors. While still in their infancy, these approaches have already demonstrated their relevance 
on several toxicity prediction and structure-activity relationship problems. 



Introduction 

Computational approaches play an increasingly important role in modern drug discovery. In particular, accurate 
predictive models accounting for the biological activity and drug-likeliness of candidate molecules can help in 
the identification of promising molecules and screening for various side-effects, leading to substantial savings 
in terms of time and costs for the development of new drugs. Such predictive models aim at inferring a re- 
lationship between the structure of a molecule and its biological and chemical properties, including toxicity, 
pharmacocinetics and activity against a target. The development of high-throughput technologies to assay such 
properties for large numbers of candidate molecules, and the subsequent availability of increasing quantities of 
molecules with characterized properties, has triggered the use of statistical and machine learning approaches to 
automatically learn the structure-property relationship from these pools of characterized molecules. 

Decades of research in machine learning and statistics have provided a profusion of methods for that 
purpose, ranging fr om classical least-square linear regression to artificial neural networks or decision trees 



jHastie et al.L uOOlh . While each method has is specificities, strengths and weaknesses, a common issue when 
one wants to infer a structure-property relationship concerns the way molecules are represented. While small 
molecules are often represented as 2D or 3D structures in chemistry and chemoinformatics, most statistical 
methods, including linear models and nonlinear neural networks, require vectors an input. Molecules must 
therefore be first encapsulated as finite-dimensional vectors, using various molecular descriptors, before being 
presented as input to these algorithms. The construction of molecular descriptors is however a difficult task. Of- 
ten a significant chemical expertise coupled with heuristic feature selection methods is needed to chose, among 
the plethora of possible molecular descriptors, the most relevant ones for a property to be predicted. The number 
of molecular descriptors must moreover be kept as small as possible to limit the complexity of the inference 
task. 
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An alternative to this issue has emerg ed recently with the advent of support vector machines (SVM) and re- 
lated kernel methods in machine learning (IVapnikill998tlScholk:opf and Smolall2002tlShawe-Tavlor and Cristianini , 



20041) . SVM is an algorithm for pattern recognition and regression that provides a useful framework to overcome 



the difficulty of data representations as vectors of low dimensions, both from a theoretical and a computational 
point of view. Theoretically, first, SVM are able to infer models in large or even infinite dimensions from a finite 
number of observations. Indeed the complexity of the learning task is not directly related to the dimension of the 
input vectors, but rather to some measure o f complexity o f the classification rules which are precisely controlled 
by SVM through the use of regularization ( IVapnikill998 ). Practically, second, a computational trick known as 
the kernel trick allows the estimation of models with a complexity that does not depend on the dimension of the 
input, but only on the number of training points. Hence training a model with vectors of infinite dimension is no 
more computationally demanding than training a model for small fingerprints - as long as the so-called kernel 
function, which corresponds to the inner product of the vectors, can be computed efficiently. Combined together 
these properties give SVM the ability to work with molecules represented by vectors of large or even infinite 
dimension in a computationally efficient framework, leveraging the burden of feature selection and giving the 
modelers new opportunities to imagine large sets of molecular descriptors. 

SVM often provide state-of-the-art performances on many classification and regression tasks, and enjoy 
therefore an increasing popularity in various application fields, including bioinformatics and chemoinformatics 



dScholkopf et al.L 120041) . For example, SV M have been applied to the prediction of the activity of m olecules on 



a number of target classes (Burbidge et all 200 It Weston et al. , 2003 ; Arimoto et al. , 2005 ; Briem a nd Giinthei 



2005; Liu et al., 2004b; Saeh et al. . 120051: iTobita et al.L 120051). toxicological properties (Kramer et al.. 12002 : 



Helma et al.- ,2004; Luan et al.. .20051) . drug-likeliness fevv atov et all 120031: iMUUeret al.L 2005; Takaoka et al 



20031) , blood-brain barrier permeability jPo niger et al., 2002), enantiosele ctivity dAires-de Sousa and Gasteigei , 
2005h . aqueous solubility ( Lind and Maltsevai 12003 ), or isoelectric point ( Liu et al.L l2004a). to name just a few. 



While most recent successful applications of SVM in chemoinformatics were obtained by just plugging clas- 
sical molecular descriptors to the SVM, an increasing line of work seeks to investigate the unique opportunities 
offered by SVM to go beyond classical fingerprints and mo lecular descriptors, th ank s to the kernel trick. This 
avenue was pioneered simultaneously and independently by ,Kashima et al.l (l2003h and lGartner et alJ (l2003h who 
proposed to represent the 2D structure of a molecule by an infinite-dimensional vector of linear fragment counts 
and sho wed how SVM can handle this representation with the kernel tri ck. Later work quickly refined these 2D 
kernels ( Kashima et al. , 2004; Mahe et al. , 20051: Ralaivola et al.U2005h and proposed new infinite -dimensional 
representations of 3D structures ( Swamidass et all l2005l: |Mahe et al., 20061: lAzencott et allEoOTl) . 

These first attempts to enlarge the flexibility of molecular descriptor-based predictive models represent a 
promising direction for in silico modelling of structure-property relationship, because they illustrate the unique 
possibilities offered by SVM and more generally kernel methods in this context. We review them in this paper 
with the hope to offer a state-of-the-art description of the latest development in this field, and an invitation for 
the chemoinformatics community to further investigate these possibilities. For that purpose we first provide 
a quick introduction to SVM and kernels in Section [T] and illustrate the relevance of the kernel trick when 
working with 2D structures of molecules with a simple example of 2D kernel in Section |2] This example is 
further generalized and connected to recent work on 2D kernels in Section [3] and practical issues with these 
kernels are discussed in Section|4] In Section|5]we present another approach that focuses on the representation 
of 3D structures of molecules, and discuss practical issues for this approach in Section |6] We conclude by a 
discussion and suggestions for future work in Section|2l 



1 Support vector machines and kernels 



SVM is a machine lea rning algorithm for pattern reco gnition originally developed in the early 1990's by V. 



Vapnik and coworkers dBoser et allll992tlVapnildll998 ). Although various extensions to multiclass classifica- 



tion, regression, outlier detection or feature construction also exist, we focus in this review on the simple pattern 
recognition problem and refer the interested reader to various textbooks to know more about t hese extensions, 
collectively known as kernel methods (Scholkopf and Smola, 2002; Shaw e-Taylor and Cristi anini. 2004). A 
pattern recognition problem occurs when one is given a finite set of objects that belong to two possible classes, 
and must learn from this training set a rule to automatically predict the class of objects with unknown class. 
This general and abstract formulation encompasses in fact a number of practical situations in chemoinformat- 
ics and beyond. We focus here in particular on situations where the objects available are small molecules, 
and the classes to be predicted represents various properties of interest such as toxic/non toxic, druggable/non- 
druggable, or inhibitor/non-inhibitor of a given target. Hence a typical pattern recognition problem could be. 
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given a list of toxic and non-toxic molecules, to learn a rule to predict whether a new candidate molecule is 
toxic or not. 

More formally, we represent the training set available as a set of n objects xi, . . . ,Xn & X, where X denotes 
the set of all possible objects, and associated binary labels . . . , y„ G {—1; !}• In our case each object Xi 
represents a molecule, X denotes the set of all possible molecules, and the two classes 1 and — 1 are arbitrary 
representations of two classes or interest, such as "toxic" and "non-toxic". Pattern recognition algorithms, such 
as SVM, use this training set to produce a classifier / : X { — 1;1} that can be used to predict the class of 
any new data x E Xhy the value f{x). When objects are d-dimensional vectors, that is, X = M'', the classifier 
output by SVM is based on the sign of a linear function: 

f{x) ^sign{{w,x) +b) , (1) 

for some {w, b) G A" x R defined below. In this case the classifier has a geometric interpretation: the hyperplane 
{w,x) + b = separates the input space X into two half-spaces, and the prediction of the class of a new point 
depends on its position on the one or on the other side of the hyperplane. The particular hyperplane selected by 
SVM is the one that solves the following optimization problem : 

cY,L{y,,{w,x,)+b) \ , (2) 
1=1 J 

where C is a parameter and L{y, t) is the hinge loss function equal to if j/i > 1, and 1 — yt otherwise. For 
a given training example {xi, yi), the hinge loss term L {yi, {w, Xi) + b) quantifies how "good" the prediction 
(w, Xi) + 5 of a candidate classifier (w, b) is, in the sense that the better the prediction, the smaller the loss. For 
example there is no loss when w and b are such that yi {{w, Xi) + 6) > 1, which means that (w, Xi) + b has the 
sign of yi and is larger than 1 in absolute value. In other words the loss is zero when the prediction is correct 
and made with large confidence. Now the second term in the sum (|2]l is the average loss over the training set of 
the candidate classifier (w, b): it is small when the classifier fits well the training points, i.e., makes on average 
"good" predictions. On the other hand, the first term in ^ is small when the slope of the classifier is 

small. The two terms in (|2|l are often in conflicts, especially in large dimension, because it is often difficult to 
fit the training points well with linear functions of limited slopes. The rational behind the optimization problem 
(|2| is indeed to find a linear classifier that reaches a trade-off between the goodness of fit on the training set (as 
quantified by the second term of this sum), and the smoothness of the classifier (as quantified by the first term). 
The parameter C controls this trade-off, by balancing the importance of each term. In the extreme case when 
C = +00 and the training points can be correctly separated by a hyperplane, then no error is allowed on the 
training set and the classifier with largest margin is found (Figure[T]i. 

It is often interesting to rewrite problem (|2|i in an equivalent way, using classical optimization theory. Indeed, 
this problem is equivalent to the following quadratic problem, called its dual: 




max 

cteK" 



n ^ n I 

51"^ ~ i YI a^ajy^yJ{xi,XJ) > 



(3) 

n 

subject to : a^j/i — and < < C , i E [1 : n] . 

Both problems (|2]i and (|3]l are equivalent in the sense that the solution (w* , 6*) of the primal problem (|2]i can be 
deduced from the solution a* of the dual problem (|3]l. In particular, it can be shown that w* = X^ILi '^IVi^i' 
and b* can also be deduced from a* . As a result, the decision function ([T]i can also be expressed in terms of the 
solution a* of the dual problem: 

/(x) =sign . (4) 

Let us now consider the use SVM for pattern recognition with molecules, represented for example by their 
2D or 3D structures. Such structures being not vectors, they can not be directly input to SVM. Instead we need 
to embed the set of 2D or 3D structures of molecules X to a vector space H through a mapping ^ : X ^ Ti. 
We can then apply the SVM algorithm to the training vectors ^{xi),i = 1, . . . ,n, as illustrated in Figure |2] 
An important point to notice is that in the dual formulation (|3]l, the data are only present through dot-products: 



3 




<w,x> + b = 



Figure 1: SVM estimates a linear separation between the classes. When the training patterns are linearly separable 
and the trade-off parameter C in Equation (O is set to +00, then the separating hyperplane selected by SVM is 
the one that maximizes the distance to the closest point on each side (7 on this picture). In general, some training 
points may be misclassified by the selected hyperplane to control overfitting. 




Figure 2: In order to use SVM with molecules, we need to define an embedding of the space of molecules to a 
vector space, i.e., a representation of each molecule x as a vector $(x). Note that, contrary to usual fingerprint- 
based approaches, the vector space might have a large or even infinite dimension. 
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pairwise dot-products between the training points during the learning phase in (O, and dot-products between 
a new data and the training points during the prediction phase in (|4|l. This means that instead of expHcitly 
knowing $(a:) for any x G X,it suffices to be able to compute inner products of the form: 

k{x,x')^mx),<S>{x')), (5) 

for any x, x' G X. In that case the dual optimization problem (O solved by SVM can be rewritten as follows: 



1=1 i,]=i J 



(6) 

n 

subject to : Uiyi = and < ofi < C , i G [1 : n] . 

i=l 

Moreover the classification function ([T]l becomes: 

f{x) = sign a:k{x, X,) + fe* j . (7) 

Hence we see that for both the training of the SVM (|6]l and the prediction of the class of new points (Q, the 
feature map $ only appears through the function k, which is called a kernel. Importantly it is sometimes easier 
to compute directly the ke rnel k{x, x') betw een two points than their explicit representations as vectors in TL. 
In fact a classical result of lAronszain (fl950l) characterizes all functions k : X x X ^ M. that are valid kernel. 



i.e., for which there exists a feature space Ti and a mapping ^ : X ^ TL such that (|5]l holds (they constitute 
the so-called class of positive definite functions). Hence, with this characterization at hand, any kernel k can be 
used with a SVM as long as it satisfies the positive definiteness property. 

The formulation of SVM in terms of kernels (|6]|2l) offers at least two major advantages over the formulation 
in terms of explicit vectors First, it enables the straightforward extension of the linear SVM to non linear 
decision functions by using a nonlinear kernel, while keeping its nice properties intact (e.g., unicity of the solu- 
tion, robustness to over-fitting, etc.). As an example, the Gaussian kernel k{x, x') = exp (— ||a; — 2;'|p/2(T^) 
is positive definite and can therefore be used as a kernel in the SVM algorithm (|6]l. Plugging this kernel into ^ 
we see that the resulting discrimination function has the form: 

\\x-i 




f{x) = sign {} a* exp 



2cr2 



which is clearly a nonlinear function of x. Second, this formulation offers the possibility to directly apply SVM 
to non-vectorial data, such as 2D or 3D structures of molecules, provided a positive definite kernel function to 
compare these structures is defined. The definition of such structure kernels for molecules is explained in the 
following sections. 



2 A simple kernel for 2D structures 

It is common to describe the 2D structure of a molecule as a labeled undirected graph G — (V, E), with atoms 
as vertices V and covalent bounds as edges E. Here we assume that a label is assigned to each node and edge, 
typically to describe the type of atoms and bounds involved. In order to train linear models for structure-property 
relationship prediction, each labeled graph G representing a molecule must first be transformed into a vector 
$(G). In this section we describe a simple vector representation obtained by counting all walks of a given 
length n, and show the relevance of the kernel formulation in this case. 

A walk of length n on a graph is a sequence of n adjacent vertices. We note that this definition allows a 
given vertex or edge to be present more than once in a walk. Clearly, the number of walks of length n on a 
graph G is finite, and we denote by Wn{G) this set of walks in the following. By concatenating the labels of 
the vertices and of the edges of a walk w we obtain a sequence of labels which we denote by l{'w), the label of 
the walk w. Moreover, we note i„ the set of possible labels for walks of length n, i.e., all possible sequences 
alternating n vertex labels with n — 1 edge labels. Figure [3] illustrates these definitions. Now a simple way to 
represent a graph G by a vector is to extract all walks of length n from its structure, sort them by label, and 
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count in $(G) the number of walks with each possible label in L„. In other words the dimension of $(G) is 
equal to the size of L„, and for each possible walk label Z G in we define the coordinate <i>;(G) as the number 
of walks in G having label /. More formally the feature (G) is defined by: 

<&,(G)= 1('H=0- (8) 

A direct approach to train a linear model with these vector representations would require the explicit compu- 
tation and storage of $(G) for all graphs G in a dataset. This approach becomes problematic when n becomes 
large, because the number of walk labels increases exponentially with n. As an example, keeping only 6 types 
of atoms and 3 types of covalent bounds, the number of possible labels reaches 1, 944 for walks of length 3; 
34, 992 for walks of length 4; 629, 856 for walks of length 5; and more than 3 billions for walks of length 8. This 
explosion in the dimension of $(G) suggests in practice either to restrict oneself to walks of length 2 or 3, or to 
compress the representation $(G). The later approach is widely used in chemoinformatics because fragments 
of length 5-10 are known to provide useful information in many structure-property relationship problems. The 
solution most often encountered is to use a hash table of limited size (typ ically 1024 or 20 48) to map the vector 
$(G) onto a vector of smaller dimension, called a molecular fingerprint dOasteiger and Engel . 2003). An obvi- 
ous drawback of this solution is the danger of clashes, i.e., the mapping of different labels to the same position 
in the hashed vector. 

An alternative solution for the use of large n values is to use kernels. As we now show, indeed, kernels 
allow the estimation of linear models for vectors <1>(G) without reducing their dimension nor requiring the 
computation and storage of the vectors. Indeed, remembering from Section[T]that SVM only need the definition 
of the inner product between vectors to estimate a linear problem, we only need to show how the inner product 
for the vector representation (|8]l can be computed efficiently. For that purpose, let us write this inner product 
more explicitly for any two graphs G and G': 

($(G),$(G'))= 5] ^i{G)<^i{G') 

= E I E i('H-o| f E i(M = o| 
= E E ( E i(^H = oia(^') = o) 

uiGW„iG)w'eW„{G') \l€L„ J 

= E iGM^'K))- 

wew„{G) w'ew,^{G') 

In other words the inner product between $(G) and $(G') can be expressed exactly as the number of pairs of 
walks {w, w') of length n, respectively in G and G', with the same label. In order to show how this number 
can be computed efficiently, it is useful to introduce the product graph G x G' which is a graph whose vertices 
are pairs of vertices of G and G' with the same label, and whose edges connect pairs of vertices which are 
connected both in G and G' (Figure|4]l. In other words the vertices of G x G' are the pairs (w, v') € V x V' 
with l{v) = l{v'), and there is an edge between {vi,v'i) and (w2, ^'2) if and only if there is both an edge between 
vi and V2 in G and an edge between v[ and v'2 in G', and if both edges have the same label. It is easy to see, 
then, that a walk in the product graph is a sequence of pairs of vertices {v,v'), in G and G', that are connected 
in G X G' and therefore in G and G'. Moreover both sequences of vertices in G and G' are made of pairs of 
vertices and pairs of edges with the same label, i.e., they form a pair of walks in G and G' with the same label. 
Conversely, given any walks w in G and w' in G' with same label l{w) = l{'w'), there is a walk in the product 
graph that corresponds to the pair of walks (w, w'). In other words, there is a bijection between the pairs of 
walks in G and G' with the same label, on the one hand, and the walks on G x G', on the other hand. Hence 
counting the number of pairs of walks of length n on G and G' with the same label is equivalent to simply 
counting the number of walks of length n on G x G', as illustrated in Figure |4] It turns out that counting the 
number of walks of length 71 on a general graph (and in particular on a product graph for our purpose) can be 
easily computed by a recursion over n. Indeed, for a general graph, if we denote by Ai {v) the number of walks 
of length i starting at vertex v, then Ai (u) = 1 for any vertex u and the following recursion formula holds: 

A,+iiv) = J2 M^) , (10) 
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Figure 3: The 2D structure of a molecule (on the left) can be represented by a labeled graph (on the right). Two 
walks on the graph are illustrated, together with their label and length. 
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Figure 4: The product graph of two graphs (on the left) is obtained by considering all pairs of vertices with similar 
labels as vertices, and connecting two such vertices when the respective pairs of vertices in the initial graphs are 
connected (on the right). Each walk in the product graph (e.g., (2, 1') — (3, 2')) is associated to a pair of walks in 
the initial graphs with same labels (e.g., 2 — 3 and 1' — 2'), and vice- versa. 



where the sum is over the neighbor vertices of v. An{u) can therefore be computed for any u ^ Vhy applying 
this formula recursively over i. The number of walks of length n on the graph is then simply obtained by 
summing A„(m) over the vertices u. We observe that if we denote by A the adjacency maUix of the graph and 
by 1 the vector whose entries are all equal to 1, then (fTOb simply expresses A^'^^l as A x v4'l, and the count of 
walks of length n is equal to l^v4"^^l. 

To summarize, we have shown that for the vector representation ([8]), the inner product between two graphs 
G and G" representing the 2D structures of two molecules can be computed by (i) consttucting the adjacency 
matrix A of the product graph G x G' and (ii) computing l^A"^^ 1 using the recursion ( fTOl i. This computation 
is exact and efficient, although the dimension of the vectors can reach the billions. In particular, the complexity 
of the computation increases only linearly with n, while the number of features increases exponentially. Using 
this inner product with a kernel method for pattern recognition or regression allows to estimate a linear model 
in this space without ever computing nor storing any vector. 



3 2D kernel extensions 

The kernel for 2D sttuctures presented in the previous section to illustrate the power of kernels can be used as 
such, but many extensions have been proposed to increase the flexibility and the expressiveness of the represen- 
tation. In this section we review some of these extensions. 



3.1 Walks of various lengths 

In the computation of the kernel based on walks of length n, we note that kernels of length i < n are computed 
as intermediaries. The choice of n is arbitrary in practice and should depend on the targeted application and 
the data available. Alternatively we may decide not to choose a particular value of n, but to combine walks of 
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different lengths in a joint feature model. The inner product being additive when new features are added, the 
kernel corresponding to the feature space dHJ where all walks of length up to n are considered is the sum of 
the kernels corresponding walks of fixed length smaller than n. The complexity of the computation is barely 
increased for this extension: instead of performing the recursion ( fTOl l n times before summing the terms, one 
just need to increase a counter by the sum of the terms at each iteration. 

When n increases the inner product in this "until-n" extension grows exponentially with n and diverges. A 
solution if one wishes to use large values for n, and even infinite n to be able to include all walks, is to weight 
the contributions of different walks by a factor X{w) that will ensure convergence of the series, i.e., to consider 
the following kernel: 

C30 

fc(G,G')-^ X{w)X{w')l{l{w)^l{w')). (11) 

n=l wGW„{G) w'eW„{G') 



As an example. iGartnen (l2002h proposed to weight the contribution of walks of length i in the inner product by 



a factor /3*/^, i.e., to consider the formula: 

oo 

fc(G,G') = E E E /3"1(^H = ^M) 

oo 

= ^/3"fc„(G,G'), 

where A:„ denotes the kernel based on the count of walks of length exactly n. Remembering from the previous 
Section that kn{G, G') is equal to l^A"^^!, where A is the adjacency matrix of G x G', we can rewrite and 
factorize this kernel as follows for f3 small enough: 

oo 

/c(G,G') = 

n=l 

= /31^ J 1 ^^^^ 

\n=0 I 

Hence the computation of the inner product in the infinite-dimensional space of all walk counts can be performed 
explicitly, at the cost of inverting the sparse matrix / — (iA. In practice the first terms of the power series 
expansion provide a fast and good approximation to the complete kernel, and allow more flexibility in the 
weighting of the walks of different length. 



Another weighting scheme for walks has been proposed independently bv lKashima et al.l ( l2003h . who pro- 
pose to define Markov random walks of each graph and weight the occurrence of each walk on a graph by its 
probability under the corresponding random walk model. As for the exponential decay, the random walk weight- 
ing scheme factorizes along the walks and can be computed with the same tricks as the exponential decay walk 
kernel. 



3.2 Filtering tottering walks 



In the previous section, we did not make any restriction on the definition of walks: they are simply defined 
as successions of connected graph vertices. Because molecular graphs are essentially undirected, this generic 
definition allows walks to have an erratic behaviour, which can lead in turn to a misleading information about 
the true structure of the graph in the kernel. Indeed, arbitrarily long walks can for instance be generated by 
simply alternating between two connected vertices. A natural way to increase the expressive power of walks 
with respect to the structure of the graphs is to prevent vertices from appearing more than once in a walk. In the 
terminology of graph theory, this corresponds to defining a kernel based on common paths instead of common 
walks. Albeit very natural, this extension unfortunately renders the kernel computation untractable, as pointed 

out by lGartner eTaP (l2003h . 

A computationally efficient alternative proposed by lMahe et al. (2005) is to disregard the tottering walks in 
the enumeration of walks. As illustrated in FigurelH a tottering walk is a walk that comes back to a vertex it has 
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c — c — c c — c — c 



Figure 5: Tottering (red) and no-tottering (blue) walks. These two walks are labeled as a succession of 3 carbon 
atoms, but only the blue one involves 3 distinct atoms. 



just left. Although the notion of path is stronger than the notion of tottering walks for general graphs, they are 
equivalent on graphs wit hout cycles. The re levance of the concept of tottering walks stems from computational 
advantages: as shown bv lMahe et al. ( 2005 ) the set of tottering walks of a graph G corresponds to a set of walks 
of a transformed graph t{G), where the transformation t involves adding additional vertices and edges. As a 
result, the kernel for two graphs G and G' based on non-tottering walks only is easily computed as the standard 
walk ke rnel between the tr ansformed graphs t{G) and t{G'). More details about this transformation can be 
found inlMahe et all (l2005b. 



3.3 Increasing the expressiveness of walks 

A second criticism that can be made to walk kernels is the fact that, because of their linearity, walks bear limited 
information about the structure of a graph. A principled way to address this issue, which is actually the topic 
of the next subsection, is to introduce subgraphs of a higher level of complexity in the kernel construction. 
In practice, however, this approach usually raises additional complexity issues that can be hard to circumvent. 
A simpler alternative is to keep a walk-based characterization of graphs and introduce some form of prior 
knowledge in the graph labeling function, in order to enrich the informa tion brought by walks about the graph 



structure. This is in particular the approach taken in iMahe et al.l (120051) where a new set of labels is defined 



for the vertices of a graph, based on the local environment of the atoms in the corresponding molecule. This 
method relies on a topological index, called the Morgan index, which is defined for each atom of the molecule 
according to the following iterative procedure. Initially, the index associated to every vertex is equal to 1 . Then, 
at each step, the index of a vertex is defined as the sum of the indices associated to its neighbors at the previous 
iteration. This process is straightforward to implement in practice, since if we let Mi be the vector of Morgan 
indices computed at the i-th iteration, it reads as Mq = 1 and A/^+i AMi, where 1 is the unity vector and A 
is the adjacency matrix of the graph. 

As illustrated in Figure |6] Morgan indices make it possible to distinguish between atoms having the same 
type but different topological properties. When they are included in the labels of the vertices, these indices 
therefore define a walk as a sequence of atoms taken in a particular topological configuration. In practice, the 
advantage of this refinement is twofold. First, the introduction of topological information in walk labels enriches 
the information they bear with respect to the structure of the graphs to be compared. Second, because atoms 
are made more specific to the graph they belong to, as illustrated in Figure|6] the number of identically labeled 
atoms found in a pair of graphs automatically decreases, which has the effect of reducing the size of their product 
graph, hence the time of computing the kernel. Note that this computation advantage is surprisingly due to the 
increase in dimension of the feature space. We note however that while this Morgan process systematically 
reduces the cost of computing the kernel, performing too many iterations makes it impossible to detect common 
walks within a pair of graphs. 
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No Morgan Indices O 1 



Order 1 indices 



,03 



Figure 6: Illustration of the Morgan process. Initially, all atoms of the cycle are seen as identical. For increasing 
iterations, the presence of the N02 branch is more and more reflected in the atoms of the cycle. 



3.4 Subtree kernels 



As mentioned in the previous subsection, the linea r nature of walks lim i ts the ir ability to properly encode the 
structure of a graph. This fact is emphasized by iRamon and Gartnen (l2003b who show that graphs can be 
structurally different yet have the same walk content, which makes them indistinguishable by a kernel based 
on the count of common walks. Figure [T] illustrates this issue on a simple example. On the other hand, they 
also show that computing a perfect graph kernel, that is, a kernel mapping non-isomorphic graphs to distinct 
points in the feature space, is at least as hard as solving the graph isomorphism problem for which there is no 
known polynomial-time algorithm. This suggests that the expressiveness of graph kernels must be traded for 
their computational complexity. 



As a first step towards a refinement of the feature space used in walk-based graph kernels. lRamon and Gartnei 
(l2003h introduce a kernel function comparing graphs on the basis of their common subtrees. As illustrated in 
Figure|8] this representation looks particularly promising for molecules, since it allows to capture in a principled 
way a wide range of functional features of molecules, that typically correspond to specific branching patterns on 
their associated graphs. On the practical side, this type of kernels can be computed by means of dynamic pro- 
gramming algorithms that recursively detect and extend identical neighborhood properties within the vertices of 
the graphs to be compared, in order to explicitly build their set of common subtrees. The relative contribution 
of subtrees of different sizes is typically controlled by means of a parameter playing a similar role to that of 
the parameter (3 in Equation ( fT2] i. These algorithms have a prohibitive complexity in general, but they can be 
deployed for molecular graphs where, because of valence rules, the degree of the vertices is small in average. 
The relevance of this clas s of kernels, as w ell as its relationship with standard walk-based kernels, has been 
analyzed in details in iMahe and VertI (l2006l) . 





Figure 7: Two graphs having the same walk content, namely • : x5 ; •— » : x4 and • • • : x2, an d consequently 
mapp ed to the same point of the feature space corresponding a kernel based on the count of walks (IGartner et al 
2003h . 
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Figure 8: Illustration of the tree-structured fragment representation: a graph G (left) and an extract of its feature 
space representation (l){G) (right). Note that the green tree corresponds to a walk structure. 



4 2D kernels in practice 

As a conclusion about kernels for 2D structures, we now discuss several issues related to their application in 
practice. 



4.1 Implementation and complexity issues 

As mentioned in Section |2] an elegant way to compute walk based kernels lies in the product graph formalism 
initially introduced by Gartner et al. (2003). The basic idea of the product graph consttuction is to merge the 
pair of graphs to be compared into a single graph, in such a way that a bijection is defined between the set 
of walks of the product graph and the set of common walks of the two initial graphs. It then follows that the 
number of walks of a given length occuring at the same time in the two graphs can be obtained by simple matrix 
products, which actually offers a closed form sol ution to the computation of kernels based o n walks of infinite 



length for well chosen walk weighting schemes (IGartner et al.L l2003t iKashima et al.L 120041) . As a result, even 



though the dimensionality associated to these kernels can be very large, and actually infinite, computing these 
kernels under the product graph formalism has a polynomial complexity with respect to the product of the size 
of the graphs to be compareqj. 

In practice, this type of product-graph implementations remains time consuming, even for relatively small 
graphs, which questions the suitability of these kernels for virtual screening applications, that typically involve 
large datasets of molecules. However, if only walks up to a given length are considered, which usually makes 
sense for real world applications, fast algorithms can be used to com pute w alk k ernels, based for instance on 
trie tree structures and string kernel al gorithms (iLeshe et all I2OO2I: Ishawe-Tavlor and Cristianinil [20041) . or 



standard depth-first search procedures (IRalaivola et al.L l2005h . Moreover, alternative implementations allow 



ing to drastically reduce the time needed to compute such kernels in their general form have recently been 
proposed dvishwanathan et al.1, l2007l) . 



4.2 Kernel normalization 

A potential drawback of kernels comparing structured objects by means of their substructures lies in the fact 
that kernel values are highly dependent on the size of the objects to be compared. Indeed, big objects tend to be 
granted a higher degree of similarity than small objects for the only reason that they are made of a larger number 
of substructures. This fact can lead to a serious bias of the subsequent prediction model, and the classical way 
to tackle this issue it to apply a normalization operation in order to take into account the size of the objects in 
the value of the kernel function. In practice, the mainstream normaUzation scheme is given by the following 

'More precisely, the worst case complexity is cubic with respect to the product of the size of the graphs. 
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expression: 



k{x,y) 



where k is the original kernel, and k its normalized value. Note that this normalization operation has the effect 
of setting the diagonal of the kernel matrix to one, meaning that individual objects are given the same degree 
of self-similarity, whatever their size is. Geometrically, this amounts to scaling all vectors to unit norm before 
taking their inner product. 

In the context of molecular graph kernels, a lternative normalization schemes based on the Tanimoto simi- 



larity coefficient have recently been introduced (iRalaivola et al.L l2005t ISwamidass et al.L l2005h . The Tanimoto 



coefficient is widely used in chemoinformatics to assess the similarity of molecular fingerprints. For a pair of 
fingerprints {A, B), it is defined as: 



Tai 



A^A + B^B-A^B' 



For binary fingerprints, it can be seen as the ratio between their intersection, that is, the number of bits set to 
one in both fingerprints, and their union. As pointed out by Raliurolji et al. (2005), since it is based on inner 
product operations, this coefficient can be generaUzed to any kernel function, leading to the notion of Tanimoto 
kernel, defined for a kernel k as: 



k{x,y) 



k{x,y) 



k{x,x) + k{y,y) - k{x,y)' 



This transformation provides an alternative way to normalize kernel functions in the sense that k{x,x) = 1 
for all X. Several varia tions on this idea, that allow to generalize the classical Tanimoto coefficient in different 
ways, are proposed in (IRalaivola et al-luOOStlSwamidass et al.ll2005h . 



4.3 Kernel parameterization 

Last but not least comes the issue of kernel parameterization. This question is of tremendous importance since 
a bad parameterization can seriously entail the success of the subsequent virtual screening application. First 
one must choose to consider the kernel based on walks or the kernel based on s ubtrees. As for now, this 
questions remains largely open since apart from the study of iMahe and Vert (l2006h . the relevance of subtrees 



in graph kernels has not been studied in details. While the preliminary results presented in this study suggest 
that subtree kernels may indeed improve over their walk-based counterparts, they also show that this class of 
substructures raises additional issues, related in particular to the computational complexity of the kernels as well 
as the explosion in the number of subtrees found in the graphs. 

Concerning the parameterization of walk kernels, the main issue concerns the length(s) of the walks to con- 
sider: either walks of a precise length, up to a maximal (but finite) length, or even up to infinite length. In 
practice, this question is highly dependent on the problem considered. Optimally choosing this parameter can 
therefore hardly be made a priori but involves cross-validation procedures. Although focusing on walks of a 
precise length can be optimal in some casefl a safe default choice is to consider walks of length up to a limited 
value to be taken around 8 or 10. Actually, because kernels based on an infinite number of walks require to 
down-weight the contribution of walks depending on their length (as in Equation ( fT2] i for instance), long walks 
are in practice so penalized that their individual contribution is barely taken into account in the kernel. Explicitly 
limiting the length of the walks to be taken into account therefore makes sense in practice. Moreover, consid- 
ering a finite number of walks provides a greater flexibihty in the way to control their relative contribution in 
the kernel, and offers the practical advantage of paving the way to the deployment of computationally cheaper 
algorithms, as discussed in Section HTTI A second important issue is related to kernel normalization. Although 
the impact of choosing the first or the second normalization scheme introduced in Se ction [4r2] has not been 



analysed in details, Tanimoto kernels led to good results in several validation studies ( Ralaivola et all 12005 



Swamidass et al.L 120051: lAzencott et al.L l2007h . Finally, one may consider further refinements such as filtering 
tottering walks and introducing Morgan indices. As shown in Mahe et al. (2005), Morgan indices of a limited 
order, typically obtained at the 2nd or 3rd iteration of the process, can indeed improve virtual screening models 



For instance, walks of length 6 or 7 can be optimal to characterize molecules mainly made of aromatic cycles. 
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Figure 9: Left: the molecule of flavone. Right: a pharmacophore made of one hydrogen bond acceptor (topmost 
sphere) and two aromatic rings, with distances di, d2 and ^3 between the features that can be extracted from its 
structure, as shown in the middle. 

while reducing their computational cost^ Filtering tottering walks should be subject to caution however. In- 
deed, as shown in Mahe and Vert (2006), while this can indeed improve the models in some cases, it seems that 
the tottering phenomenon can also be helpful to detect similarity between structurally different compounds. 

5 A 3D pharmacophore kernel 

Motivated by the fact that the tridimensional structure of molecules have a central role in many biological 
mechanisms, including drug-target interactions for instance, recent attempts have been made to develop ker- 
nels for 3D structure of molecules. In this section, we introduce a class of kernels that relies on the notion of 
pharmacophore which is widely used in chemoinformatics. A pharmacophore is usually defined as a spatial 
arrangement of three to four atom^ responsible for the biological activity of a drug molecule. In the following, 
we focus on three-points pharmacophores composed of three atoms, whose arrangement therefore forms a tri- 
angle in the 3D space (Figure|9]l, but similar ideas naturally apply to pharmacophores of different cardinalities 
With a sUght abuse we refer as pharmacophore below to any possible configuration of three atoms arranged as a 
triangle and present in a molecule, representing therefore a putative configuration responsible for the biological 
property of interest. More precisely, we consider a molecule m as a set of atoms in the 3D space, that is: 

m = {{x„k) eM.^ X C}^^^^ ^^^^ , 

where | m | is the number of atoms that compose the molecule, and {xi, li) G M.^ x C stands for its i-th atom, 
Xi being its vector of (x,y,z) coordinates, and li its label, such as its type for instance, but more generally taken 
from a set C of atom labels. With these notations at hand, the set of three-points pharmacophores that can be 
extracted from the molecule m can be formally defined as: 

Vim) = {{pi,P2,P3) e m^,pi ^P2,Pl J^P3,P2 ^Ps} ■ 

Following our discussion of Section |2] a simple way to represent a molecule m is to extract all its pharma- 
cophores, sort them by type, and count in a vector "I>(m) the number of pharmacophores of each possible type. 
Clearly, the number of pharmacophores associated to a molecule is finite, but since their definition is based on 
the precise (x,y,z) coordinates of the atoms it is made of, or equivalently on continuous inter-atomic distances, 
the space of all possible pharmacophores is infinite. Defining such a vector representation therefore requires in 
practice to discretize the space of pharmacophores, which boils down to discretizing the range of inter-atomic 
distances into a pre-defined number of bins. Formally, if we consider n bins in the discretization, this operation 
defines a space of discrete pharmacophores T = x [1,71]"^, where each pharmacophore corresponds to a 
triplet of atom labels, taken from the alphabet C, and a triplet of distance bin indices, taken in [l,n]. We can 

''Actually, this is only true for product-graph implementations. For trie-tree implementations, Morgan indices have the opposite effect 
of increasing the cost of computing the kernel. 

''More generally, pharmacophore are defined as arrangements of groups of atoms having particular properties, such as positive or 
negative polarity, high hydrophobicity, and so on. 

'in particular, similar ideas were developed in lSwamidass et afl dlOOSl) based on two-points pharmacophores, that is to say, distances 
between pairs of atoms. 
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now define the vector representation $(m), in which each coordinate $4 (to) is the number of pharmacophores 
extracted from the molecule to that correspond to the discrete pharmacophore t, that is. 



where the function l(disc(p) = t) is one if the discretized version of the pharmacophore p is t, meaning that 
they are based on the same triplet of atom labels and the same triplet of distance bins, and zero otherwise. 
The number n of bins considered in the discretization specifies the resolution at which distinct pharmacophores 
are considered to be equivalent, and constitutes a critical parameterization issue. Indeed, small distance bins 
may prevent the detection of similar pharmacophores, while large distance bins can lead to a matching between 
unrelated pharmacophores. In practice, this parameter also defines the dimension of <1>(to). For example, 
considering 6 distinct types of atoms and 10 distance bins, which corresponds to bins of 2 angstroms if pairs of 
atoms are considered to lie within the 0-20 angstrom distance range, the cardinality of T, hence the dimension of 
$(to), is 216, 000. This number is raised up to 1, 728, 000 in order to reach a precision of 1 angstrom per inter- 
atomic distance bin. This explosion in the number of dimensions suggests again that in order to explicitly store 
the vector $(to), one should either consider a limited number of bins, thereby considering a poor resolution to 
characterize molecular structures, or rely on hashing algorithms to map the vector $ (to) onto a vector of limited 
size, which as discussed previously, has the effect of inducing clashes between distinct pharmacophores. This 
representation highlights once again the benefit of using kernel functions since, following the lines of Equation 
|9] one can define the kernel: 

k{m,m') = $(G)^$(G") 

= E E l(disc(p) = disc(p')), (14) 

which, as will be discussed in Section lSTl enables to map pairs of molecules and compute their inner product in 
feature spaces indexed by millions of pharmacophores, for a computational complexity that remains polynomial 
with respect to the product of their sizes. 

Of course, the idea of representing a molecule by means of its pharmacophoric content is not new, and the 



above approach bears strong similarity with well known pharm acophore fingerprint representations (iBrown and Martin , 



19971; [Matter and Potteitll999HMcGregor and Muskalll 19991) . The above discussion nevertheless illustrates the 



interest of using kernel functions in this case, since they allow to exactly compute the inner products between 
very high-dimensional feature vectors without the need of computing nor storing them, which is not possible 
in general and comes at the price of an information loss. This is not, however, the major improvement made 
possible by kernel functions in this context. Indeed, as shown in Figure [TO] the main drawback of this approach 
lies in the discretization of the pharmacophore space itself: not only the choice of the discretization step con- 
trols the precision required to match a pair of pharmacophores, but it also prevents pharmacophores falling on 
different sides of bins edges to be matched, although they can be very close, and actually even closer that two 
pharmacophores falling in the same bin. The kernel approach allows to circumvent this discretization issue by 
means of a simple generalization of Equation ( fT4] l. where the binary function checking whether pairs of phar- 
macophores have the same discretized version or not is replaced by a general kernel between pharmacophores 
in order to continuously quantify their similarity. Letting fcp be such a kernel, this leads to the general 3D kernel 
formulation: 

fc(TO,TO')= E ''PiP^P')' (15) 

p€V{m) p' ^V{m,') 



which was introduced in iMahe et al.l (120061) . A meaningful kernel kp between pharmacophores should intu- 
itively quantify at the same the similarity of the triplets of atoms the pair of pharmacophores to be compared are 
defined from, and the similarity of their spatial arrangement. A natural way to achieve this goal, which is at the 
same time compatible with the algorithm implementing the kernel (flST l (see Section l6Tt . consists in factorizing 
the kernel kp along the pairs of atoms and inter-atomic distances that define the pair of pharmacophores to be 
compared. Mahe et al. ( 200 6) suggest for instance to introduce elementary kernel functions k^ : £ x £ — > ffi 
and fcoist : M X R — > M comparing atoms and distances respectively, and to define the kernel kp as: 

3 3 

kp{p, p') = n ^At('., n - ^^+1 II' I \< - ID' (16) 
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Figure 10: Illustration of the discretization issue, xi, X2 and X3 correspond to pharmacophores living in a dis- 
cretized bidimensional (Euclidean) space, xi is closer to X3 than it is from X2, yet the discretization affects xi and 
X2 to the same bin and X3 to another bin. The kernel of Equation (fTSl) allows to circumvent this issue. 



where the pharmacophore p (resp. p') is defined as (^{k, Xi)) (resp. (^{l[,x'^))^_^^), \\.\\ denotes the Eu- 
cHdean distance, and the index i + 1 is taken modulo 3. In this approach, the task of defining a kernel between 
3D structures therefore boils down to defining a couple of kernels comparing atoms and inter-atomic distances. 
These kernels intuitively define the elementary notions of similarity involved in the pharmacophore comparison, 
which in turns define the overall similarity between molecules. A simple default choice for these kernels is to 
define the atom kernel as a binary kernel simply checking whether the pair of atoms to be compared have 
the same label or not, that is: 

and to define the inter-atomic distance kernel fcoist as the following Gaussian radial basis function (RBF) kernel: 

where cr is a bandwidth parameter. Under this parameterization, it is interesting to note that the continuous 
kernel of Equation ( fTSI l and its discretized counter part of Equation (fT4l i share an important feature: because the 
atom kernel fcAt is binary, both kernels are based on pairs of pharmacophores defined by the same triplets of atom 
labels. The striking difference between the two formulations lies in the fact that in the kernel of Equation (fTSl ). 
the strength of the pharmacophore matching is continuously controlled by the parameter a of the (RBF) kernel 
comparing inter-atomic distances. Choosing a small value of a corresponds to imposing a strong constraint on 
the spatial similarity of pharmacophores, while a larger value of a allows pairs of pharmacophores to be taken 
into account in the kernel although their spatial configurations may differ. 

We conclude this section by noting that the class of kernels defined by Equation ( fTSl l does not have an explicit 
inner product interpretation in general, and in particular using the above parame teri z ation . Ne verthe le s s , this 
construction is known to be valid as long as the kernel hp is a proper kernel function dHausslei , Il999h . 



6 3D kernel in practice 

In this Section, we discuss general considerations related to the application of 3D kernels in practice. 

6.1 Implementation and complexity issues 

Without going into technical details, it can be shown that the class of pharmacophore kernels introduced in 
Section|5]can be computed by algorithms derived from those used for the computation of 2D kernels. Indeed, 
while the 3D structure of a molecule was previously defined as a set of atoms in the 3D space, it can equivalently 
be seen as a fully connected labeled (and undirected) graph, with atoms as vertices and inter-atomic distances 
as edge labels. Under this representation, it is easy to see that computing the continuous kernel of Equation ( flST l 
can be interpreted as computing a walk kernel restricted to the walks that define cycles of length 3 on the graphs. 
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Moreover, provided the kernel kp factorizes along the pair of pharmacophore to be compared, which is the case 
of the kernel proposed in Equation ( fTSI l. it is easy to show that this kernel can be computed by product-graph 
algorithms and simple matrix product operations, for a cubic complexity with respect to the product of the sizes 
of the molecules to be compared. While this complexity can be prohibitive for applications involving large 
datasets of molecules, the discretized version of the kernel can benefit from fast implementati ons derived, here 
also, from string kernel algorithms and trie-tree structures. We refer the interested reader to Mahe et al. I (l2006l) 



for a detailed discussion about the implementation and the computational complexity of these kernels. 

6.2 Kernel parameterization 

In its discretized version, the only parameter entering the definition of the kernel is the number of bins n to 
discretize the inter-atomic distances. As already noted in Section|5] this parameter is of critical importance since 
it controls the precision up to which pharma cophores are consi dered to be identical or not. Unfortunately, this 
parameter can hardly be chosen a priori, and Mahe et al] (1200 6') suggest to optimize this parameter using cross- 



validation procedures. In this study, when optimized over the grid {4, 6, 8, 30} for a 0-20 angstrom inter- 
atomic distance range, this parameter was usually taken between 20 and 30, which suggests that the matching 
between a pair of pharmacophores should be subject to strong spatial constraints. On the other hand, such fine 
grained resolutions have the effect of increasing the impact of the discretization issue illustrated in Figure [TOl 

Under the parameterization proposed in Section|5] the only parameter entering the definition of the general 
kernel of Equation ( fTsT i is the bandwidth a of the RBF kernel between inter-atomic distances. In the above 
study, small values of a, which correspond to strong spatial constraints in the pharmacophore comparison, are 
usually selected by cross-validation procedures. While in these cases the discrete and continuous formulations 
of the kernel tend to coincide^, the continuous formulation usually led to better performance in this study. 

6.3 Molecule enrichment 

Many mechanisms of interest in tridimensional virtual screening involve specific physicochemical properties 
of the molecules. In the case of drug-target interaction for instance, the molecular mechanisms responsible 
for the binding are known to depend on a precise 3D complementarity between the drug and the target, from 
both the steric and electrostatic perspectives. For this reason, standard pharmacophore based approaches, and 
in particular pharmacophore fingerprints, usually define pharmacophores from atoms or groups of atoms having 
particular properties. Typical molecular features of intere st are positive and n egative charges, high hydropho- 



bicity, hydrogen donors and acceptors and aromatic rings jPickett et al.L 119961) . 

Similarly to the introduction of Morgan indices in 2D kernels discussed in Section [331 the atom-based ker- 
nel constructions presented in the previous section can naturally be extended to integrate this type of external 
information using specific label enrichment schemes. For instance, ,Mahe et al.. (.2006) use a simple scheme 
where the label of an atom is composed of its type and the sign of its partial charge. Positively-charged, neutral 
and negativaly-charged atoms of carbon are therefore la beled as {C+, C", C~} in this approach. Alternative la- 



beling schemes are considered in lAzencott et al.l ( l2007h . based in particular on element hybridization, where for 



instance an sp3 carbon atom is labeled as C.3, and a typing of atoms according to conventional pharmacophoric 
features, such as polarity, hydrophobicity, and hydrogen-bond donors and acceptors. These studies show that, 
in general, such label enrichments have a positive influence on the subsequent structure-activity relationship 



mode ls, while enabUng to drastically reduce the computation cost of the kernels in some cases (IMahe et al 
2006h . 



6.4 Conformational analysis 

For real-world applications, considering the tridimensional structure of molecules raises the additional issue 
of conformational analysis. Indeed, because of the presence of rotational bonds, molecules are not static in 
the 3D space, but can alternate between several spatial configurations of low-energy called conformations. 
The mainstream approach to conformational analysis is to represent a molecule as a set of structures, called 
conformers, sampled from its class of admissible conformations. On the methodological side, this operation 
casts the learning problem into the framework of multi-instance learning, t hat has been drawin g a considerable 



interest in the machine learning community since its initial formulation (.Dietterich et al.Lll997l) . The SVM and 



kernel approaches lend themselves particularly well to this problem, due, on the one hand, to extensions of the 



Indeed, in the extreme case where a tends to and n to +oo, both formulations are equivalent. 
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SVM algorithm ([Andrews et al.l 120021) . and, o n the other hand, to th e possibility to define kernels between sets 



of structures from a kernel between structures jGartner et al.L l2002h . A possible solution to the latter approach 



consists in averaging kernel values over all possible pairs of conformers. While more elaborated schemes can 
be adopted, such as (Blaschko and Hofmann, 2006) for instance, this simple configuration was akeady shown 
to be efficient in practice by lAzencott et alJ (l2007l) . 



7 Discussion 

As a conclusion, it is probably fair to say that the empirical evaluations of the different kernel constructions 
introduced in this paper demonstrate the relevance of the approach based on structure kernels for virtual screen- 
ing. Indeed, on the different tasks they have been tested on, including notably the prediction of high mutagenic- 
ity molecules and drug-target inhibitors, these kernels often compare favorably to state-of-the-art approaches. 
Moreover, because of the intrinsic modularity of kernel methods, this approach offers, to some extent, a unified 
approach to SAR and virtual screening, for two reasons. First, because they circumvent the need of selecting 
and extracting molecular descriptors, these kernels can straightforwardly be used to model different biological 
properties. Second, although we focused in this paper on classification applications, these kernels can be used 
in conjunction with the whole family of algorithms called kernel methods to solve a great variety of tasks which 
are relevant for virtual screening and chemoinformatics applications, such as, for instance, regression, cluster- 
ing and similarity analysis. Concerning its practical use for the screening of large datasets however, it must be 
stressed that the approach based on kernel methods can be computationally demanding, even for relatively small 
datasets. Speeding up SVM and kernel methods for large datasets is currently a topic of interest in the machine 
learning community, and applications in virtual screening on large databases of molecules will certainly benefit 
from the advances in this field. The choice of a particular kernel, or even more importantly, of the 2D or 3D 
representation of molecular structures, should be dictated by the application considered. For example, while it 
is widely accepted that sev eral drug-like properties, such as intestinal absorption dLipinski et al.Ll200li) or muta- 



genicity jKing et al.Lll996l) for instance, can be efficiently deduced from the 2D structure of the molecule, target 
binding prediction is known to depend on a precise 3D complementarity between the structures of the drug and 
the target, from both the steric and electrostatic perspectives (iBohm et al.[ l2003l) . Nevertheless, even in such 
problems that intrinsically depend on tridimensional mechanisms, it is not clear that models based on 3D kernels 
are more efficient than models based on 2D kernels. This fact is especially emphasized in Azencott et al. (200l 
where 2D ker nels are shown to outperform 3D kernel^ in general, which actually tallies previous fingerprint- 
based studies dBrown and Martin , Il996lll997l) . 



We see many potential extensions to the general kernel constructions presented above: 

• First, the fact that the models could benefit from simple data enrichment schemes, based, for instance, on 
Morgan indices in the 2D case and partial charges in the 3D case, suggests that the introduction of a more 
thorough chemical knowledge could improve the expressive power of the kernels. In particular, several 
reduced representations of molecular structures exist, defined, for instance, by merging aromatic cycles 
and atoms that are part of the same functional groups in the 2D representation ("Oillet et al.', '2003), or by 
considering generic pharmacophoric features instead of isolated atoms in the 3D case (Pickett et al., 1996|). 
Applying such transformations in a pre-processing step is most likely to improve the characterization of 
the molecular structures in the kernels, while reducing their computational cost. 

• Other important issues that, in our opinion, would be worth studying in more details are related to confor- 
mational analysis, and more precisely to the way the conformational space of a molecule is sampled and 
multi-instance kernels ar e defined. Although i n their current form 3D kernels tend to be outperformed 



by their 2D counterparts dAzencott et aU 120071) . we believe that a proper handling of multi-conformers, 
together with a higher level of pharmacophoric characterization of molecules, can have a great impact for 
virtual screening applications. 

• Another possible extension would be to adopt a global representation of molecules and to integrate the 
information derived from their ID, 2D and 3D structures. A possible approach would be to consider a 
single kernel defined as a linear combination of kernels for 2D and 3D structures, together with a simple 
kernel based on global physicochemical properties. Several methods have been proposed to optimize 



'including 3D kernels based on multi-conformers. 
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such a kernel combina tion within the framew ork of support- vector machines based, for instance, on semi- 
definite programming (iLanckriet et al.Ll2004l) . 

Finally, in the case of drug-target prediction when additional information about the structure of the target 
is available, it would be interesting to combine the ligand- and the structure-based approaches to virtual 
screening, that would most likely benefit to each other in this context. 



Last but not least, note that this gentle introduction to kernels for molecular structures and virtual screening 
apphcations only reflects our own view and experience, and was deliberately biased towards our own devel- 
opments in thi s field. Ind eed it must be stress ed that, following the pioneer introduction of graph kernels by 
Kashima et al.l (l2003h and lGartner et al. I (l20Q3h . several alternative kernel constructions have been proposed in 
recent years, among which: 



A graph kernel based on the detection of cyclic- and tree- patterns by iHorvath et al.l (120041) . 

A graph kernel based on the count of common paths by lBorgwardt and Kriegell(l2005h . However, because 
it is not possible to consider exhaustive sets of paths, as mentioned in Section [3l2l the kernel construction 
is restricted to the sets of shortest paths between pairs of vertices. 

An optimal assignment kernel, based on the idea of optimally assigning the atoms from one molecule to 
those of another, by Frohlich et al. (2005). This kernel formulates as the sum of a kernel between pairs of 
atoms, that has to be maximized over all possible assignment of the set of atoms of the smaUer molecule 
to the set of atoms of the bigger one. Unfortunately, albeit very natural, this kernel is not positive definite 
and might require additional tricks to be used with kernel methods. 

Finally, borrowing techniques from computational geometry, standard walk-based graph kernels have re- 
cently been exte nded to kernels between tridimensional structures, based on graphs approximating molec- 
ular surfaces by lAzencott et al. ( 2007 ). 



Together with the references given in the above presentation, this list constitutes, to our knowledge, a compre- 
hensive view of kernel for molecular structures with applications in virtual screening. As an ending remark, we 
would like to mention that open-source implementations of the family of kernels introduced in this paper can be 
found within the C-n- ChemCpp toolbox, freely and publicly available at http : //chemcpp . source forge . net. 
We hope that this introductory presentation, together with the availability of this software, will help and motivate 
the chemoinformatics community to further investigate S VMs and molecular kernels to model structure-activity 
relationship. 
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